Privacy on Reddit? Towards Large-scale User Classification

نویسندگان

  • Benjamin Fabian
  • Annika Baumann
  • Marian Keil
چکیده

Reddit is a social news website that aims to provide user privacy by encouraging them to use pseudonyms and refraining from any kind of personal data collection. However, users are often not aware of possibilities to indirectly gather a lot of information about them by analyzing their contributions and behaviour on this site. In order to investigate the feasibility of large-scale user classification with respect to the attributes social gender and citizenship this article provides and evaluates several data mining techniques. First, a large text corpus is collected from Reddit and annotations are derived using lexical rules. Then, a discriminative approach on classification using support vector machines is undertaken and extended by using topics generated by a latent Dirichlet allocation as features. Based on supervised latent Dirichlet allocation, a new generative model is drafted and implemented that captures Reddit's specific structure of organizing information exchange. Finally, the presented techniques for user classification are evaluated and compared in terms of classification performance as well as time efficiency. Our results indicate that large-scale user classification on Reddit is feasible, which may raise privacy concerns among its community. 1 Introduction The popularity of social media and social networking has risen continuously since its first appearance. Users generate massive amounts of content on Facebook, Twitter, and similar social networking sites as well as on blogs, video sharing platforms, etc. Many companies are searching for opportunities in analyzing social media entries (Kaplan and Haenlein, 2010). This ranges from automatic processing of product reviews and opinions especially with the social components of online retailers (Oelke et al., 2009; Popescu and Etzioni, 2007) to predictions of the financial market (Ferguson et al., 2009) and even mining companies' market structures (Netzer et al., 2012). Also social sciences increasingly use the information contained in social websites. Especially political movements such as the Arab Spring (Lotan et al., 2011) or Occupy Wall Street (Tremayne, 2014) have been studied with the help of content analysis of social media. Many of these analyses require knowledge of certain demographics such as the origin of the authors of messages. However, many users might begin to share less information about their profiles as anonymity and privacy are gaining more attention during the last years. In particular, the so called NSA leak in which highly confidential documents about online user surveillance by the US

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

User Migration in Online Social Networks: A Case Study on Reddit During a Period of Community Unrest

Platforms like Reddit have attracted large and vibrant communities, but the individuals in those communities are free to migrate to other platforms at any time. History has borne this out with the mass migration from Slashdot to Digg. The underlying motivations of individuals who migrate between platforms, and the conditions that favor migration online are not well-understood. We examine Reddit...

متن کامل

Developing a New Method in Object Based Classification to Updating Large Scale Maps with Emphasis on Building Feature

According to the cities expansion, updating urban maps for urban planning is important and its effectiveness is depend on the information extraction / change detection accuracy. Information extraction methods are divided into two groups, including Pixel-Based (PB) and Object-Based (OB). OB analysis has overcome the limitations of PB analysis (producing salt-pepper results and features with hole...

متن کامل

Reddit Temporal N-gram Corpus and its Applications on Paraphrase and Semantic Similarity in Social Media using a Topic-based Latent Semantic Analysis

This paper introduces a new large-scale n-gram corpus that is created specifically from social media text. Two distinguishing characteristics of this corpus are its monthly temporal attribute and that it is created from 1.65 billion comments of user-generated text in Reddit. The usefulness of this corpus is exemplified and evaluated by a novel Topic-based Latent Semantic Analysis (TLSA) algorit...

متن کامل

Blogs, Twitter Feeds, and Reddit Comments: Cross-domain Authorship Attribution

Stylometry is a form of authorship attribution that relies on the linguistic information to attribute documents of unknown authorship based on the writing styles of a suspect set of authors. This paper focuses on the cross-domain subproblem where the known and suspect documents differ in the setting in which they were created. Three distinct domains, Twitter feeds, blog entries, and Reddit comm...

متن کامل

From Zoos to Safaris - From Closed-World Enforcement to Open-World Assessment of Privacy

In this paper, we develop a user-centric privacy framework for quantitatively assessing the exposure of personal information in open settings. Our formalization addresses key-challenges posed by such open settings, such as the unstructured dissemination of heterogeneous information and the necessity of userand context-dependent privacy requirements. We propose a new definition of information se...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015